PREFACE . s e sma ce ee ee e e e one t  ] ng 


- iii - 


LIST OF TABLES ecs o o c oTe e o eee eee iv 
LIST OF FIGURES ...... rne a o 59 v 
Chapter 
_I, FREQUENCY DISTRIBUTIONS ........ 1 
II. MEASURES OF CENTRAL TENDENCY. ... 11 
II. PERCENTILES AND NORMS ... es 2l 
IV. MEASURES OF VARIABILITY -++++ +++ 25 
V. CORRELATION o s . e sol ieie e ete t t tn 37 
VI, EVALUATION AND INTERPRETATION 
ORO TESTS) o ore EE E da 
VII. SUMMARY AND CONCLUSIONS. . e e s... 63 
APPENDIX 
I PROBLEMS AND ANSWERS «****-*** 67 
Il. SAMPLING PROCEDURES .+++++++-> 19 
III. SUGGESTED READINGS AND 
REFERENCES sas e cr n n n n nn 83 


LIST OF TABLES 

Page 
Achievement Scores on an English 
rr GEET 1 
Illustration of Interval Method. . . . . . . .. 2 
A Better Illustration of Interval Method . . . . 3 
Calculation'of the Mean: =) <5 das ere 12 
Raw Scores Expressed as Deviation Scores . - 14 
Calculation of the Mode e ere cere S e e w eie 15 
Calculation of the Median, Nis Odd. . . . .. 16 
Calculation of the Median, Nis Even. . . . .- 17 
A Comparison of the Three Measures of 
Centraliendencyatem. vane Ee 18 
Computation of Percentiles .......... 22 
Percentiles Obtained on a Standardized Test. . 23 
Two Distributions with Equal Means but 
Different Ranges . ee 26 
Two Distributions with Equal Ranges but 
Dissimilar Patterns of Dispersion een 27 
Calculation of Standard Deviation, Deviation E 
Methodi c ad pug o QUOD bondes O d. dx O 29 
Calculation of Standard Deviation, Whole 
SCOISSMethodos e emo a UNIT o” sep edt scie eine 30 
Comparison of SD's for Two Distributions. - - 32 
Illustration of Perfect Positive Correlation . . 37 
Illustration of Perfect Negative Correlation . . 38 
Illustration of Rank-difference Method of 
(ite Etgen. into olg Bu E aloe pia o bs 41 
Illustration of Pearson product-moment r . . . 44 
Illustration of Split-Half Method of Correla- 
ubi 5 OOONOIMDT E HP EIROT DIEA AE E 52 
Calculation of a Validity Coefficient . . . . . . 58 


Elementary Statistical Methods 
for Amas 


EDUCATIONAL MEASUREMENT 


by 


ALBERT E. BARTZ 


Concordia College 


BURGESS PUBLISHING COMPANY 
426 South 6th Street - Minneapolis jt 


zm Minnesota d 


r2. e 


renun Edni. "on Research 
DAVID HA . a id ae 


Dated. SC? Peers 


Le m LÀ Dar Schier 


Copyright 1958 
by 
Albert E. Bartz 


Library of Congress Catalog Card Number—58-12770 


Printed in the United States of America 


SCIENTIFIC BOOK AGENCY 
Post Box 239 
103, Netaji Subhas Road, Cal-l. 


PREFACE 


This manual for elementary statistics is designed to 


` serve two purposes: (1) to assist the student in understand- 


ing the statistics necessary for constructing more adequate 
teacher-made tests; and (2) to assist the student in the in- 
ter} -etation of test manuals accompanying standardized tests. 


The statistical methods explained and the statistical 
compltations presented are those which might be typical 
for the average classroom situation, where the teacher's 
primary concern-is measuring and evaluating the progress 
of students in one classroom group... 


It is the belief-of the author that many college instruc- 
tors in tests and measurements feel that no single text is 
sufficient for their purpose; rather, they prefer to use sev- 
eraltexts as reference material. It is with this purpose in 
mind that this manual is constructed. This workbook is 
intended to give the student the necessary background mate- 
rialthat is prerequisite for a thorough understanding of test 
practices. With proper usage much class time that is usually 
expended in explanations and computations can be saved. 


Each chapter has a number of illustrative examples with 
full explanations of the calculations involved. It will be to 
the student's advantage to follow the steps closely, and then 
go on to the problems for each chapter in the appendix. 


` These examples and extra problems plus the step by step 


explanation and possible interpretations for each statistic 
should give the reader ample insight into each area of meas- 
urement. 


Also included in the appendix are a number of refer- 


ences and suggested readings, and a section on sampling 
procedures used in connection with standardized tests. 


OS 


i 


Because of the introductory nature of this manual, 
some statistical procedures have been omitted. For example, 
the calculation of the various statistics using grouped data 
are not included. Most classroom situations involve a rela- 
tively small number of scores, and grouping of data would 
not be necessary. The interpretation of the various statis- 
tics, whether calculated from raw data or grouped data, is 
essentially the same. The student is referred to the refer- 
ences in the appendix for the computational procedures based 
on grouped data. E 


In order to make the illustrative examples easy to 
understand, the calculations are based on a small number of 
scores. In the classroom situation the number of scores 
will usually be greater than ten. As the number of measure- 
ments increases, the results will usually be more reliable. 


A note of thanks is due Dr. Hermann F. Buegel, 
Head, Department of Psychology at the University of North 
Dakota, and Dr. Alton Rogness, Head, Department of Educa- 
tion at Concordia College, for many helpful comments and- 
suggestions. 


EE P 


LIST OF FIGURES 


Histogram of English Scores . . . . . . . .. s 49 
Theoretical Normal Curve of Distribution . . . 7 


Negative Skewness tothe Left ......... 7 


Positive Skewness to the Right ......... 8 
Normal Curve Showne Mean and 

SD Distances ee e Pa meer nM Sono o oa 33 
Distribution Curve Showing Mean and 

IEN IB ge ED ES ad odi Teo 
Comparison of Two Standard Scores ...... 36 


Chapter I 
FREQUENCY DISTRIBUTIONS 


All data, whether from tests, questionnaires or experi- 
mental procedures must be analyzed in some way to be of 
any use to the people involved. Let us consider an example 
to show what we mean. 


In Table 1 below are 50 scores made by students on 
an achievement test in English. 


Table 1 


ACHIEVEMENT SCORES ON AN 
ENGLISH EXAMINATION 


69 "0 72 62 78 
ma 85 (2 73 91: 


71 61 85 82 82 
82 81 74 79 90 
66 88 82 86 83 
89 94 86 76 75 
81 79 93 76 80 
68 81 64 87 80 
95 $5 84 90 


88 97 86 68 


Notice how difficult it is to find any meanings for the 
numbers in Table 1 above. With some effort we can find the 
highest score, 97, or the lowest score, 61. We might want 
to know how many scored above 85 or how many people scored 
82 on this test. This would be difficult with such a table. 


We could arrange the scores in order from highest to 
lowest and thus get some semblance of order to our data. 
This would immediately give us the highest and lowest score. 
However, we can organize our data still further by grouping 
our scores. 


zd 


2 FREQUENCY DISTRIBUTIONS 


GROUPED DATA 


To organize our data into groups, we need only to pick 
some convenient class interval, such as 5 or 10, and tabu- 
late the number of each score that falls into a particular 
interval. In checking the English Scores, we find the highest 
Score is 97 and the lowest is 61. If we select a large class 
interval such as 10, we shall have only four such intervals. 
They are shown in Table 2. 


Table 2 
ILLUSTRATION OF INTERVAL METHOD 


Tally 


IIll ggg 

HAT Ml Ml. HMM 
lll Illl HMM 

I 111 


Frequency 


Notice that each interval contains 10 Scores, and not 
9 as it would first appear. For instance, the interval 90-99 
contains scores of 90-91-92-93-94-95-96-97-98 and 99, or 
10 scores. 


Frequency listed at the top of the table refers to the 
number of scores falling in a certain class. We go through 
any set of scores one by one and place a tally mark by the 
interval in which each Particular score lies. The frequency 
column is simply the addition of these tally marks. 


Unfortunately Table 2 has only four classes, and for 
our purposes this grouping has resulted in only four divisions 
which do not tell us much more about our Scores. It would 
be better to refine our table by using more and smaller in- 
tervals. This is entirely an arbitrary affair and we might 


FREQUENCY DISTRIBUTIONS 3 


EE 


conceivably have many classes with an interval of only one 
Score. For our purposes, we want enough intervals to give 
us a good idea of where the scores fall with the least effort. 


Let us choose eight classes with intervals of 5. This 
would give Table 3. 


Table 3 


A BETTER ILLUSTRATION OF THE 
INTERVAL METHOD 


// 


NOON 


m 


7 
7 
5 

2a 


o 
e 


Notice that with this refinement we can get a better 
idea of the spread and concentration of our scores. We 
must remember in choosing our intervals that the first in- 
terval must contain the highest score and the last interval 
the lowest score. 


N, in Table 3, refers to the number of students in th. 
group measured, orto the number of measurements repre- 
sented by the group scores. We can check the accuracy of 
our tabulation by comparing the total tally marks for fre- 
quency with the number of the measurements in our original 
data. Naturally, this will give us a check on only the total 
number, but does not give us a check on whether we have 
the tally marks placed in the correct interval. 


4 FREQUENCY DISTRIBUTIONS 


We can check on the accuracy of the tallies for each 
interval if we arrange our scores by order of magnitude, 
and draw lines Separating each interval. Then we can see 
at a glance how many fall into a given interval. For example, 
from our data of Table 1: 


HE 
94 


93 l 
92 

91 f 
90 
90 


This checks with Table 3, in that there are two scores 
in the interval 95-99, 6 in 90-94, and similarly for the rest 
of the table. 


J GRAPHING THE FREQUENCY DISTRIBUTION 


By inspecting the frequency distribution of Table 3, 
we can find a certain degree of orderliness in our data. We 
can see that a few people made low scores and a few made 
high scores. The larger number of scores tended toward 
the middle of the distribution. However, it is difficult to 
picture the entire distribution as a whole. Also, certain 
irregularities in the distribution may escape à casual glance. 
These difficulties can be overcome by constructing a graph 
of our scores. 


A method for graphin requency distributi is the 


histogram. Below is a histogram of the English scores 
given in Table 3. 


The steps in the construction of a histogram are as 
follows: ; 


1. Lay outan area or a piece of graph paper that 
corresponds roughly to the size and proportions 
of Figure 1. Itisa good practice to have the 


( 


D zb e MAR Amas n vx 


A 


naa 2 e 


E 


FREQUENCY 


FREQUENCY DISTRIBUTIONS 5 


height of the graph about 3/4 of the length. The 
horizontal line, called the x - axis, is drawn 
long enough to include all of the scores plus a 
little unused space at each end. Label this line 
(Scores in this case) and indicate the lower limit 
of each interval by the proper scores. 


At the left end of the horizontal line draw a verti- 
calline called the y - axis. Divide the y - axis 
into units so that the greatest frequency will not 
quite reach the top of the graph. Number these 
units and label this axis (Frequency in this case). 


Now the histogram can be completed by simply 
drawing lines parallel to the x - axis at the height 
represented by the frequency for each interval. 


60 65 70 75 80 85 90 95 100 
SCORES 


Figure 1 
Histogram of English Scores 


6 FREQUENCY DISTRIBUTIONS 


4. Give the histogram a title, either above or below 
the figure. The title should be a clear statement 
of what the histogram represents. 


When we examine the histogram of Figure 1, we see that 
it is not symmetrical, and that the shape is irregular. How- 
ever, there is a tendency for many scores to fall towards 
the center of the distribution, with a few scores on either 
side of the "hump". We see, also, that there is a wide vari- 
ation in the scores. 


THE NORMAL CURVE OF DISTRIBUTION 


When we have only 50 scores, we must expect a lack of 
symmetry in the distribution. If we measured 500 students, 
Some of the irregularities would smoothen out. And again, 
if we had a thousand students a still more symmetrical dis- 
tribution would occur. 


We will assume that if an infinite number of scores 
were obtained we would have a perfectly symmetrical dis- 
tribution, or curve. This curve is called the theoretical 
normal curve of distribution. It is illustrated in Figure 2. 


This curve is never obtained in practice, although there 
may be many close approximations to it. However, it has 
Some properties which will be of use to us in a later chapter. 


SKEWNESS 


Sometimes the scores tend to group themselves more 
heavily toward one end of the distribution than the other. 
When we plot a curve of these uneven distributions we say 
that they are skewed. Ee j 


If the scores are heavily concentrated toward the upper 
end of the distribution, we say that the curve is negatively 
skewed, as in Figure 3. 


FREQUENCY DISTRIBUTIONS 9 


A graph of the frequency distribution is called a histo- 
gram. This enables us to examine our distribution more 
thoroughly than with only a frequency distribution. This 
pictorial representation gives us the information about the 
spread and the concentration of scores at a glance. 


The normal curve of distribution is a. theoretical model 
never achieved in practice, but closely approximated. 


If the distribution tends to cluster towards one end or 
the other the curves are said to be skewed. The two types 


are: 


a. negative skewness (to the left), and 
_b. positive skewness (to the right). 


Chapter II 
MEASURES OF CENTRAL TENDENCY 


The frequency distribution and the histogram are devices 
by which we can organize and obtain some meaning from our 
data. But an analysis of these is limited, because they are 
not specific enough for our purposes. We would prefer ad- 
ditional statistical techniques that would express the nature 
of our data in more economicalterms. One of the main 
tasks of statistics is to reduce large masses of data to easily 
understandable, quantitative terms. 


For example, we may want to know how the class scored 
as a whole on some test. To do this, we want measures 
which will tell us where the scores are concentrated. In this 
chapter we will discuss these measures, usually referred to 
as measures of central tendency. They are the mean, median, 
and the mode. 


THE MEAN 


The mean is commonly referred to as the arithmetic 
average. Let us consider Table 4 in which the calculation 
of the mean is shown. 


Notice that the mean (M) is simply the sum of all scores 
divided by the number of scores. The symbol "X" refers to 
ascore. The symbol "z' means "summation" or "sum of". 
Thus, EX means the sum of all of the X's, or in other words, 
the sum of the scores. N, of course, is the number of cases. 
By summing the scores, you obtain a value for =X. In this 
example, the sum is 50. N here is 10. 

Thus the Mean is 


12 MEASURES OF CENTRAL TENDENCY 


Table 4 
CALCULATION OF THE MEAN 


Individual 


s-7mossoour 
e 
Elo o eto co 63 rd 


M 
A 
[] 


There are different methods for obtaining the mean. 
For example, there is a special method for calculating the 
mean of data that has been grouped into classes, such as we 
have in Table 3. However, since we will be usually working 
with data that is not too extensive, we will use the method 
given in Table 4. This method is more accurate than work- 
ing with data that is grouped in classes. For the student 
who is interested in methods of finding means for grouped 
data, the references cited in the appendix will suffice. 


What is the mean? What does our M = 5 in the example 
in Table 4 tell us about our data? ‘The mean is the typical 
performance on a test or task by the group as a whole. 

When we speak of the average score on an exam, we mean 
the representative value of this group. The M - 5in 
Table 4, states that the score of 5 on this test is the repre- 
sentative value of this group as a whole. 3 


Another use of the mean is in comparison with what a 
particular individual did on an exam. Did he score above 
the mean or below it? Is he above average or below average? 
We can answer these questions in terms of his deviation or 
distance from the mean value. 


MEASURES OF CENTRAL TENDENCY 13 


Instead of reporting an individual's score as in Table 4, 
we could also report his score in terms of its deviation from 
the Mean. For example, Individual A's deviation score 
would.be -3, since his.score is 2 and the Mean is 5. Like- 
wise, Individual B's score would be +2, since his score is 
7, and the Mean again is 5. Deviation scores tell us how 
far each individual's score is from the mean, and in which 
direction. (A positive deviation score shows that his score 
is above the mean while a negative deviation shows he was | 
below the mean. ) 


As was explained previously, the score of any particular 
individual is denoted as X. An individual's deviation score 
“is denoted as "x". Therefore as a general formula, we may 
write x = X -M. é 


This simply states what we mentioned in a previous 
paragraph. An individual's deviation score (x) is obtained 
by subtracting the mean (M) from his raw score. 
scores refer to the scores that we obtained directly from 
‘the test. 


For Individual A 


x=X-M 
x=2-5 
x=-3 


For Individual B 


x=X-M 
rage cab 
x=2 


Let us repeat Table 4 and express each score as a devi- 
ation from the mean. 


, Mag 


14 MEASURES OF CENTRAL TENDENCY 
LA <€ 
Table 5 
RAW SCORES EXPRESSED AS DEVIATION SCORES 


Individual Xx x 
A 2 -3 
B 7 2 
[o 8 3 
D 6 1 
E 3 -2 
F 6 1 
G 2 -3 
H 3 -2 
I 8 3 
"Uie zB SECH 

N- 10 =X = 50 =x =0 


Notice that the sum of all of the deviations from the 
mean (Z x) is zero. “This is one of the characteristics of 
the arithmetical average. It is calculated in such a way 
that it is directly in the center of these deviations, making 
the algebraic sum of these deviations zero. 


THE MODE 


The French expression "a la mode" literally means the 
vogue, or in style. That is exactly what the mode is. It 
is the score that is made the most frequently, or in other 
words seems "to be in style". The mode is also classified 
as a measure of central tendency. A glance at a frequency 
distribution shows the grouping about a central measure, 
and the mode is the highest point in the hump, or the most 
frequent score (see Figure 1). The mode is easily obtained 
by inspection, but it is the crudest measure of central tend- 
ency. The mode is not as valuable in the analysis of data 


as eithér the median or mean. Consider the following table. 


D 


AAA 


AA 


F^ 


A 


X po 


H 


= Le 


PESA MA AS 


MEASURES OF CENTRAL TENDENCY 15 


Table 6 
- CALCULATION OF THE MODE 


Principle Mode - 21 


Secondary Mode - 19 


The Mode here is 21 since four individuals made that 
score. However, three individuals scored 19, and so it is 
necessary to distinguish between the Principle Mode and 
the Secondary Mode. 1f 21 and 19 would have had the same 
frequency, we would have had a bi-modal distribution. The 
chief value of the mode lies in the fact that it is easily ob- 
tained by inspection and is useful in locating points of con- 
centration of like scores in a distribution. 


THE MEDIAN 


Another measure of central tendency is the median. 
The median is simply the middle score in the distribution. 
We will see in the next chapter that the median is also the 
50th percentile. This follows logically, since if the median 
score in a group is the central score, 50% fall below the 
median. This will be discussed fully in the next chapter. 


The median can be computed easily if a set of scores 
is arranged according to magnitude. If N is an odd number, 


EE rr mg 


16 MEASURES OF CENTRAL TENDENCY 
p 
such as in Table 7, the median can be computed by calcu- 
lating N+ 1. 

2 


In the example, N+1 _ 9+1 5 
ES : 


2 


We then count up from the bottom five scores. In 
Table 7 the fifth score is 14, and this is the median for this 
distribution. 


Table 7 
CALCULATION OF THE MEDIAN, N IS ODD 


Med = Nu posal 5th score from 


à 2 the bottom 


bth score is 14 


If Nis an even number such as in Table 8, the median 
is computed by calculating N+1 10-1 _ 5.5 
2T proa eT AS 


Counting up from the bottom, the fifth score is 10. 
However, our median is half-way between the fifth and sixth 
score. Since the next score is 12, the median is half-way 
between 10 and 12, or 11. 


As with the mean, the median score does not actually 
have to be a score made by an individual. The median sat- 
isfies the definition as being the central score in the distri- 
bution, and with an even number of cases this value may 


MEASURES OF CENTRAL TENDENCY 17 


Table 8 
CALCULATION OF THE MEDIAN, N IS EVEN 


fm the 


bottom 


not actually exist in the distribution. However, itis a 
measure of central tendency in that it shows the central 
point in the distribution. The chief use of the median, of 
course, is denoting the center of distribution. Jt is not af- 
fected by atypical values, since it does not actually use the 
vaiue of the score, but only indicates the center of the dis- 
tribution. 


For example, suppose we wish to compute the measure 
of central tendency of the annual incomes in a certain vil- 
lage. Now let us further suppose that in this certain com- 
munity there is one millionaire and the rest are coal-miners. 
This is highly exaggerated, but it will show the value of the 
median in some cases, against the mean and the mode. 
Shall we use the mean, median, or mode? In quoting the 
incomes for each citizen, only by coincidence would two 
incomes be exactly the same. Thus the mode could not be 
used, since in a group of incomes several would hardly be 
the same. How about the mean? Since the mean is computed 
on the basis of the actual value of the score, the million- 
aire's income would unduly influence the distribution, and 
our measure of central tendency would be much too high. 
However, since the median is the center of our distribution 


A 


18 MEASURES OF CENTRAL TENDENCY 


of incomes, the median value would be the most accurate q 
measure of central tendency for the annual income of the 


villagers. A 
| ( 
rag 
COMPARISON OF THE MEAN, dc 
MEDIAN, AND MODE A 
jJ 
Now that we have discussed the different measures of De 
central tendency, when might one be preferred over the Ke 
other two? K 
Let us first consider a single distribution and compare A 
the three measures from it. @ 
5 [cf 
Table 9 f 
COMPARISON OF THE THREE MEASURES ` 
OF CENTRAL TENDENCY q 
6 
ka 
d 
e 
cl 
ct 
et 
9 ( 
9 [ 
E Dh 
6 E 
6 e 
5 EM 
4 c 
DX = 156 j 
N- 18 (m 
A 


MEASURES OF CENTRAL TENDENCY 19 


The Mode: the most frequent score is 10. 
The Median: 


N+1 18-41 
Pa ër ja 


= 9.5 score from the end is midway 
between 9 and 9. 


The median is 9. 


The Mean: 
-ZX _ 156 _ 
M=- = 78 = 8.7 (approx.) 
Summary 
Mode = 10 
Median = 9 
Mean = 8.7 


In this distribution the mode gives the highest value 
and the mean is the lowest. The low value of the mean is 
due to the preponderance of low scores. 


SUMMARY 


The three measures of central tendency give us concise 
information about the nature of the distribution. 


The arithmetic mean takes into account the magnitude 
of each score. Therefore, the mean should be used when- 
ever we want all the scores to determine the measure of 
central tendency. However, there are times when the mean 
is not the most representative value, as was shown in the 
example given with the median on page 15 and 16. 


The median should be used when there are a few atypical 
values in the distribution, because the median shows the ex- 
act center of the distribution and is not affected by the size 
of the scores. 


SNA em Pe ec 


20 MEASURES OF CENTRAL TENDENCY 


The mode can be used as a.preliminary inspection device, 
because a quick glance at the frequency distribution will 
Show us the most frequently made score. It is useful in 
pointing out concentrations of like Scores, but it is a very 
crude measure of central tendency and is very easily in- 
fluenced by other factors. 


A 
Po, 


DENS S 
APs 


6 
e 


Chapter III 
PERCENTILES AND NORMS 


In the preceding chapter we found that the median is 
that point in a score distribution below which lie 5075 of the 
scores.' In exactly the same way, we may calculate points - 
below which lie 20%, 40%, 68% or any per cent of the scores. 
These points are called percentiles and are usually denoted 
by the symbol Pp where P is a percentile and p is the per 
cent of cases below a given value. Py (read "a percentile - 
of 20" or the "20th percentile"), for example, is the point 
below which lie 20% of the scores. Pg is the point below 
which lie 68% of the scores. It is obvious that the median 
can also be referred to as Py. 


Let us consider the calculation of Pp. The general 
formula for finding Pp in a set of scores ranked in magnitude 
is: e 


Pp = Nil (rio) score from the bottom. 


The usual method for finding percentiles is with grouped 
data where N is large. The method illustrated above is ap- 
plicable to ungrouped data only. The calculation of percen- 
tiles for group data where N is large can be found in any of 
the references listed in the appendix. Our main purpose 
here is to understand how it is used in test construction and 
interpretation. 


When we speak of Bu = 28, we mean that 10% of the 
scores were below 28. Similarly, for Pg; = 35, we mean 
> that 50% of the scores fell below 35. We may use percent- 
2 2 ages to interpret an individual's score on a test. If Po = 75, 
P and the individual's score is 76 or above, we know that in 
regard to the rest of the class, 90% of the individuals scored 
? below him, placing him in the top ten percent. Similarly, 
ab if Pj = 25, and the individual's score is 22, we know that 
» K he is in the bottom 10 per cent of his class. q 
) 
> 


ER PERCENTILES AND NORMS $ 
i eee. 


Ù 
Table 10 UG 
COMPUTATION OF PERCENTILES [d 
ra 
q 
j 
a 
£ 
BO é 
N+1 (Foo) = 20 (1/2) d 
10th score from the bottom « 

35 ] 
TOME Gi 
20 o9) - 20 (1/10) « 
2nd score from the bottom e 
28 « 

20 (88) , 
100 ( 
13.6 score from the bottom E 
36 j ` 
4 
H 

aM 
CK 
py 
Ç 4 
NORMS ; o 
Just how is this information of use to us in the inter- » | 
pretation of tests? When a standardized test is placed on ) if 
the market, a set of norms is included in the test packet. NE 


These norms are computed on the basis of some sample used Car 
by the test authors to "standardize" their tests. An example 


i4 
may clarify this point. 2. 
` Cx 
Let us suppose that a certain national test is intended ) Wi 
for use in determining college entrance requirements. When “4 
the test is constructed, a random sample is chosen whose 


Scores on the test will be compiled in tables called norms. 


TAG S 


ST Gg, ES Y MEC Y 
wl d wo oW Ww Ww 


C 
y 


PERCENTILES AND NORMS 23 


In this case, a random sample of a thousand or more college 
freshmen will be selected from different colleges throughout 
the nation. Their scores on the test are then arranged in 

a frequency distribution, and the percentiles corresponding 
to each score position are computed. These scores and 
their corresponding percentiles constitute the norms which 
accompany this standardized test. The norms may look like 
this: 


+ Table 11 
PERCENTILES OBTAINED ON A STANDARDIZED TEST 


Raw Score Percentile 
122 99 
121 99 
120 98 
119 E 98 

84 51 
83 51 
82 


When this test is administered to any other group of 
college freshmen, each of these student's scores can be 
compared with the scores on the standarized norm table and 
his corresponding percentile rank obtained. Likewise, whole 
classes of college freshman can be compared to the 1000 
students in the original sample. 


24 PERCENTILES AND NORMS 


In selecting a standarized test to be used for classroom 
purposes, it is important, therefore, to select a test whose 
standardization group resembles the group which you are 
going to test. 


Percentiles can also be used to compare an individual's 
performance on two or more tests. This is valuable since 
it is impossible to compare raw scores directly. Knowing 
that an individual scored 27 on a reading test, 54 on an 
arithmetic achievement test, and 128 on a culture test, does 
not help us in estimating the individual's ability. However, 
if we know that his three scores corresponded to the 76th, 
64th, and 83rd percentile on these tests, we can tell some- 
thing about his performance in regard to the rest of the 
‘examinees. 


SUMMARY 


Percentiles are useful in comparing one individual with 
the rest of the group. They are also useful in comparing 
how one individual compares on two or more tests. 


Percentiles based on standardized tests for some par- 
ticular sample are called norms. Their chief value lies in 
comparing an individual or entire group with the sample 
used to standarize this test. 


A Ii. OP ad! ae wéi 


D j^ je pé $6 Ze AM MM Kiki RYAN 


H 


Chapter IV | 
MEASURES OF VARIABILI TY 


MEANING OF VARIABILITY 


A friend may ask you about the gasoline mileage that 
you are getting on your car. You might reply that it runs 
around 16 miles to the gallon. This does not tell your friend 
that city driving in heavy traffic would only give about 13 
miles to the gallon, while out on the highway you might get 
19 or 20 miles to the gallon. We could say that while 16 
might be the best single figure to represent your distribution 
of gas mileage, it tells you nothing about the variation that 
occurs from different types of driving. 


One characteristic of any set of scores is the variability 
of the distribution. In a previous chapter we have had occa- 
sion to examine different distributions for measures of cen- 
tral tendency or the representative value of the group as a 
whole. One characteristic we noticed was that not everybody 
made the same score. Rather there was a tendency for some 
people to score below the mean, and some above. „This fluc- D 


tuation of scores about the mean value is -called.variability, 


We frequently use measures of variability to compare 
two sets of test scores. The mean alone is not enough, 
Since the means of the two sets could be identical, yet the 
distributions could be very dissimilar. 


In Table 12, the means of the two groups are identical, 
but notice how dissimilar the two distributions are. Ob- 


viously, we need more than just the mean to know the charac- 
teristics of our distribution of Scores. 


-25- 


26 MEASURES OF VARIABILITY 


Table 12 


TWO DISTRIBUTIONS WITH EQUAL MEANS 
BUT DIFFERENT RANGES 


Distribution I Distribution II 


m 
l o o 3660 S| 
[ 
-1 
NH 
ep 
z 
a 
1 
- 


zx 


= 
e 
M 
[a 
za 
I 
a 
eo 


N-10 


THE RANGE 


One of the simplest and niost straight-forward measures 
of variation about a mean value is the Range. (The Range, 
as we mentioned previously, is the distance between the two 
extreme scores. For example, if the highest score in a 
distribution is 97 and the lowest is 65, the Range, is 
97 - 65 = 32. This value tells us something about the dis- 
persion of our distribution. The larger the Range the larger 
is the dispersion of scores from the mean value.) In Table 
12, the means are identical but the Range for Distribution I 
was 10 - 4 = 6 and for Distribution II, 12 - 1 - 11. Obviously, 
the scores in Distribution II were scattered more widely 
about the mean than in Distribution I. Although the Range 
is a good preliminary method for determining the dispersion 
of scores, it is limited in that it does not tell us the pattern. 
of this dispersion. Consider Table 13. 


NON 
JN 
CENA 


ARTA 
AAA 


S 


^ 
2) 
> 
2) 
Dy 


SF 


Le d eee L1 dl 0 hr eS ee kr Se hw ae Lo n Wa we A 


MEASURES OF VARIABILITY 21 


Table 13 


TWO DISTRIBUTIONS WITH EQUAL RANGES BUT 
DISSIMILAR PATTERNS OF DISPERSION 


Distribution I - Distribution II 
pecker: | x 
17 17 
15 17 
15 M-142-14.2 17 .M,-140- 14 
15 0 17 0 
15 16 
14 R,-17-10-7 16 R,=17-10=7 
14 10 
14 g 10 
13 10 
AO 10. 
ZX = 142 ZX = 140 
N= 10 N= 10 


In both distributions in Table 13, the ranges are equal 
(7) and means are almost equal (14.2 and 14.0). But look 
how different the two distributions are. Distribution I has 
a high score of 17 and a low of 10. Notice that the rest of 
the values are tightly grouped around the center of the dis- 
tribution. Distribution II has the same extreme values, 17 
and 10, but notice that the rest of the values are tightly 
grouped towards the ends of the distributions with a gap in 
the middle. It is evident that the range does not tell us all 
we need to know about the variability of a set of scores. 
We need to have some measure of variability that will take 
into account the pattern of the distribution. Such a measure 
is the standard deviation. 


THE STANDARD DE VIATION 


As we mentioned in a previous chapter, each score in 
a distribution varies from the mean by a greater or lesser 
amount, (except, of course, when the score is the same as 


28 MEASURES OF VARIABILITY 


the mean value.) It would seem obvious, then, that we 
might measure the amount -of variability about the mean by 
using the deviations of each score from the mean. x A 
Table 5 we did exactly that by finding the deviation value 
(x) for each score. 


A measure of variability could be the average value of 
these deviations from the mean. However, you saw that 
the sum of these deviations (Zx) about the mean is always 
equalto zero. This makes it arithmetically impossible to 
work with. The problem, then, is to find some way to get 
rid of the negative signs in front of the deviations. We can 
get rid of all of the negative numbers by squaring the devia- 
tions. You will remember from elementary algebra that 
when two numbers of the.same. sign are multiplied together, 
the “product is always positive. So when we Square each of 
the ‘deviations, we get positive numbers whether our original 
deviations are negative or positive. Then, since we have 
squared the deviation, all we have to do is take t the square 
root ‘oot of them to get back to the original units of measurement. 
deviations, ‘taken from the arithmetic mean of the distribu- 
tion -- is called the standard deviation. The formula that 
represents the standard deviation is 


SDP =) zx 
N 


where SD is the standard deviation = x? 
refers to the sum of all squared devia- 
tions from the mean 


N is the number of scores. 


Table 14 shows the computation of the standard devia- 
tion using the above formula. 


Wa "wx S db A LZ dA oon 


L x > ie MEL OU. dh, AL "Age ` 


"u AA AT ao a6 NÉS OO DD e 


MEASURES OF VARIABILITY 29 
n >>» M o 
É Table 14 


CALCULATION OF STANDARD DEVIATION 
(DEVIATION METHOD) 


xt 
25 
16 


SD= E Ze - [12.8 = 3.6 
(app. ) 


POP OF usas 


M 
x 
ze 
" 
m 
N 
co 


The above example shows the calculation of SD by use 
of the deviation method. However, it is laborious to com- 
pute since it is necessary to compute the Mean, subtract 
the Mean from each score, square these deviations, average 
these deviations, and extract the square root. A simpler 
formula has been developed which will give the same results 
with much less work. The formula is 


where =X? refers to sum of squared scores 
M is the mean 
and N is the number of scores. 


This formula is mathematically the same, and it is 
much easier to compute. The method using this formula 
is called the whole score method, because the original whole 
Scores are used instead of the deviations. Notice that in 
the above formula it is ZX? which is the sum of the Squares 
of the whole scores, whereas in a previous formula it was 
zx?, the squares of the deviations. 


D 


30 MEASURES OF VARIABILITY 


Table 15 


CALCULATION OF THE STANDARD DEVIATION, 
WHOLE SCORE METHOD 


X x N = 10 
$ 12 144 M = 143 = 14.3 
12 144 10 
14 196 
17 289 M? = (14.3)? = 204.5 
14 196 
13 169 Substitute: 
16 256 
15 225 SD = |zx? - M? 
15 225 N 
15 225 


zX = 143 zX? = 2069 = |2069 - 204.5 
10 


= 4206.9 - 204.5 


The steps illustrated in Table 15 are: 


1. Place the scores in the column X. 


2. Square each score and enter the results in the 
column X?, 


3. Sum each of the columns to obtain LX and £X?. 


4. Compute the Mean. 


wë ` dub du 


M 


P NONI Kär wë ok, SS NON 


E 


~ OM 


WM WM V XV 


b ir cdm Ee rib dli dr Dre É 


MEASURES OF VARIABILITY 31 


5. Square the Mean. 
6. Substitute in the formula for zX?, N, and Mi. 


7. Perform the necessary calculations under the 
Square root sign. 


8. Extract the square root. 


Just what does a SD value tell us about the nature of 
in a distribution deviate from the mean. If the value of the 
SD is small, there is little variability, and the majority 

of the scores are tightly clustered about the mean. 1f the 
SD is large, the scores are more widely scattered above 
and below the mean. We can use the SD for comparing 
two groups to see how they differ in variability. For ex- 
ample, let us repeat Table 13, in which we illustrated two 
groups; one with scores tightly grouped and another group 
in which the scores were more widely scattered. 


Distribution II has a SD almost twice as large as 
Distribution I. This is due to the different patterns in the 
distribution. The means are almost equal and the ranges 
are equal, but the distributions are patterned differently 
and the SD gives us this information. 


The importance of the SD cannot be over-emphasized. 
Because the concept of the SD is basic to the construction 
and interpretation of tests the student should familiarize 
himself with the preceding discussion and calculations. 


MEASURES OF VARIABILITY 


32 


(dd) LºT = as 


(*dd*) g'g = as 


8'0r|- 
: or ' $8902 -;XX OFT = X= 
96T - 8902 | = 001 Or 


00T 0T 


90% -;XX ZbI- XX 


oT 001 OT a 
9°T0@ - 902| = 69T EI 


ES 00T oT 96T PI 
aN -;XZX|- ds 007 oT ES 961 PI 
. 98€ ar zN - 2XE| = dS 96T PI 
962 9I 144 ST 
9°10Z = JW GGG ST 
or 68z it EES ST 
F1= OFT = 682 LI Z P= PT = GGG ST 
OT=N 686 AT OT=N 682 AL 
x zX 


I uonnqrustq 


II UOHnqEnustqg 


SNOILRSINISIO OML YOS S,dS IHL dO NOSISVdlNOS 
9T 9|gel 6 


WE e: Wë CX C Ni Wë 2 "Wi 


MEASURES OF VARIABILITY 1 Oo88 


THE SD AND THE NORMAL CURVE 


As previously stated, a normal curve of distribution 
is never obtained in practice. However, many times in our 
testing, we find our scores closely approximating a bell 
shaped distribution. If we assume that our curve of distribu- 
tion is similar to the theoretical normal curve, our SD is 
even more useful than before. 


M 


Figure 5 
Normal Curve Showing Mean and SD Distances 


E 


In the normal curve shown in Figure 5, the Mean is 
erected from the base line and this vertical line divides the 
distribution into two equal parts. In other words, 50 per 
cent of the scores lie to the left of the Mean and 50 per cent 
to the right. Vertical lines are also erected from the base 
line corresponding to the different SD units. The mathe- 
matics used in placing“) ~ perpendiculars in the normal 
curve are beyond the sc se of this booklet and will not be 
explained here. 

ë ¢ 

The SD units are so placed that the area (number of 
scores) between -1 and +1 SD units from the Mean corre- 
sponds to approximately 68 per cent of the total area. Sim- 
ilarly, 95 per cent of the area lies between -2 and +2 SD 
units from the Mean. Approximately 99 per cent of the cases 
lie between -3 and +3 SD units from the Mean. 

$ 


34 MEASURES OF VARIABILITY 


Of what value is this information? We can treat our 
distribution (if we assume that our distribution is approxi- 
mately normal) in the same way. We can construct a dis- 
tribution curve for our set of scores and label it in the same 
way as in Figure 5. à 


As an example, let us take the following information 
from a set of scores and construct a distribution curve: 


Range of scores: 20-80 


Mean = 50 
SD = 10 
-3 -2 -l M +1 +2 +3 
20 30 40 50 60 70 80 
SCORES 
Figure 6 


Distribution Curve Showing Mean and SD Units 


From the information presented in the distribution 
, curve we can see at a glance how our scores are spread 
di out and how many scores fall in any particular SD unit. 
This may be of value in assigning letter grades, A, BAGS 
D, E, and Fon a class test. 


JPN APDO ES 


ava va 


Sa SI IN UT 


D- 


ZA, 


SE "e SN "ew: y y 


ys 


~~ ry Sy y yyy 


ON "e CM e WX CX Ay SW WW W-E 


MEASURES OF VARIABILITY 35 


Score on two or more tests. Suppose an individual makes 
‘a score of 75 on Test I and 70 on Test II. At a glance we 
might say that he did better on Test I than on Test II. How- 
ever, we need more information before we can make a valid 
comparison. 


Suppose that we have the following information: 


Test I Test II 
Individual's Score 15 70 
M 60 60 
SD 10 5 


~“ We cannot compare raw scores from the two different 
distributions. We must convert the raw scores to Standard 
Scores. To dothis, we subtract the Mean from the scores 
and divide by the SD. The formuia is: 


Standard Score = X- M 
SD 


where X is an individual's score 
M is the Mean 
SD is the standard deviation 


Test I Test II 


75 - 60-2 1.5 70 - 60 = 2.0 


10 5 


Notice that the Individual's score on Test II gives a 
higher Standard Score. We can get an idea of why this 
happens when we look at the size of the SD for the different 
tests. For Test II, the SD is only 5 while for Test I it is 

«10. We know that the smaller SD indicates that the distribu- 
tion is more tightly grouped about the mean than in Test I. | 
Accordingly, his score in terms of SD units, places him 
further to the right of the mean than his SD score in Test I. 
The graph is shown in Figure T. 


* 


36 "MEASURES OF VARIABILITY _ 


Figure 7 E 
Comparison of Two Standard Scores 
M 


SUMMARY 


The SD endbies us to find out just how much our scores 
are scattered in the distribution and can be used to compare 
an individual's scores on two or more tests. 


We must temember that the concept of the SD is based 
on the theoretical normal curve. If our distribution reason- 
ably approximates this curve, we may believe that our 
Statistics are very nearly correct. The more that our dis- 
tribution is skewed, the more likely the chance that our 

- figures are in error. There are ways by which these Scores 

pit Skewed distribution can be transformed into a normal 

"curve. The reader is referred to one of the references 
listed in the appendix. However, for practical purposes, 
the more closely a distribution of scores approximates the 
normal curve, the less will be the probability of errors. 


e. 


C 


LE oL Veil em mn t t E 


eo 


FSS ve A Y Ay cw Ve e "d e 9CV Ei wel e E Ww Ww 


Chapter V 
CORRELATION 


In our previous discussion we have referred to only one 
frequency distribution, or in other words one score for each 
individual. We also compared two frequency distributions 
in which we were concerned with the comparison of two 
groups of different individuals. Previously we have not dis- 
cussed ways in which we can compare scores on two differ- 
enttests. If a group takes two different tests, how are their 
Scores on one test related to the scores on the other test? 


The method that we use to define this relationship is 
called correlation. An example might clarify the meaning 
of correlation. ` 


Table 17 
AN ILLUSTRATION OF PERFECT 
POSITIVE CORRELATION x 
Test A Test B É 
Individual X Rank X Rank 
1 


where X refers to a score on Test A 
and B and Rank refers to the individual's 
standing on the test. 


Notice that the relationship is perfect. Individual A 
scored the highest on both tests. Individual E scored the ^... 
lowest. When we assign the rank for each score for evety "w 
individual we see that they agree perfectly on both tests. 


29m. 


38 CORRELATION 


This is perfect positive correlation. This relationship, or 
correlation coefficient, is denoted by +1.0 which means 
that there is a perfect positive correlation. 


We also might have the reverse situation. 


Table 18 


AN ILLUSTRATION OF PERFECT 
NEGATIVE CORRELATION 


Test A 
Individual X Rank 


17 
15 
14 
13 
12 


Observe in Table 18 that there isa relationship here 
also. However, in this case, Individual A scored highest on 
one test, but lowest on the other. The ranks for the other 
individuals are similarly reversed. This is still perfect 
correlation, but in a negative direction. This type of per- 
fect negative correlation is denoted by the correlation co- 
efficient -1.0. 

oT 


Now if there were no relationship, Individual A might 
rank first on Test I, but third on Test II. Individual C 
might rank third on Test I, but fifth on Test II, and so on. 
In other words, there would be no pattern of relationship 
shown in the data. The correlation coefficient would be 


0.0, or no relationship. 


What type of data would give us these different correla- 
tion coefficients? Suppose that we desire to know the scho- 
lastic ability of a number of football players. Furthermore, 
we would like to know the relationship of scholastic ability 
to football prowess. Are football players good students, 
poor students, or average? 


o "SS 4 


CORRELATION 39 


It would be necessary to rank each player on his ability 
in football. Then we would give an achievement test to each 
member and obtain a score for his scholastic ability. Now, 
what relationship is there between athletic ability and sch 


lastic ability? If the best football player received the high- 


est score in the achievement test, the next football player 
the next highest score, and so on, we would have a perfect 
relationship, a correlation of 41.0. This would substantiate 
a theory that athletic ability and scholastic ability go hand 
in hand. 


If the best football player received the lowest score on 
the achievement test, the second best football player re- 
ceived the second lowest score, etc. for the rest of the 
distribution, we would have a perfect negative relationship, 
a correlation of -1.0. This would substantiate a theory 
that the best football players are the worst students and the 


' best students are the worst football players. 


If there were no relationship between football ability 
and scholarship ability the correlation would be 0.0. This 
would substantiate a theory that athletic and scholastic 
ability are not related. 


The correlation coefficient can vary from perfect posi- 
tive, 41.0 to -1.0, perfect negative. The extreme values 
are rarely obtained in practice, but high values, for example 
+. 89 or 4.69 are common. As the coefficient ranges from 
0.0 to 41.0, our relationship becomes greater, until we 
reach 41.0, where our relationship is perfect. As the co- 
efficient ranges from 0 to -1.0, the same holds txue, the 
correlation becomes greater, but in a negative relationship. 


This is valuable when you want to compare two tests 
taken by the same individuals. If we say that the correla- 
tion coefficient is .89, we know that the relationship between 
the scores on the two tests is good. Similarly, if the cor- 
relation coefficient is low, say .15, we know that the re- 
lationship is poor. 


40 CORRELATION 


If the correlation coefficient is high for the two tests, 
we know that the tests must be similar to regard to what 
they are measuring in the individuals. In the next chapter 
we will further discuss the interpretation of the correlation 
coefficients with regard to tests and test construction, but 
in this chapter we will be primarily concerned with the com- 
putational procedures; 


Vics RANK-DIFFERENCE METHOD 
(SPEARMAN RHO) 


One way in which we can determine the relationship 
between two sets of scores is by computing a correlation 
coefficient on the basis of the ranks of the individuals. This 
method is called the Rank-difference Method. We need to 
Tank each one of the individuals on both tests, obtain the dif- 
ference in rank, square the differences, sum these differ- 
ences, and substitute in the following formula. 


p = 1 - 6zD? 
N(N? - 1) 


where p is the correlation coefficient 


=D? is the sum of the squared differences 
in the two ranks 


N is the number of individuals. VÁ 


d EX EX 


E 


a = 


A 


q 
5 
à 


"1 


AA = 


Nea a 


CORRELATION ia il 
i € Table 19 
ut ILLUSTRATION OF RANK-DIFFERENCE METHOD 
q * 
? Individual X, R2 
3 A di 2 
B 17 1 
y E A 6 
x D is 3 
HIS 5 
F 12 4 
> Gi 7 
: H 8 
; I 10 
: J 9 
d 
d N - 10 
) Nº = 100 
o. p= 1) — El 
3 Nw = 1) 
` - 1 - 6(19.5) 
C 10100 -1 
=1- 117 
: 990 


> 
LU 
= 
1 
m 
[em 
œ 


p = .882 or .88 


e The steps in the computation of the Rank-difference 
Method are: 


1. Divide the sheet into seven columns labeled: 
Individual; X, (score on Test I); R, (rank on Test I); 


42 CORRELATION 


TT [TT TT 2a St 


X; (score on Test II); R, (rank on Test II; D (dif- 
ference in ranks); and D? (difference Squared). 


2. Place the score for each individual in the appropri- 
ate columns. 


3. Rank these scores on the basis of the highest score 
as Rank 1, and so forth for both tests. 


4. Obtain the difference by subtracting R, from R; 
and place in the D column. 


9. Square each one of these differences and place in 
the D? column. 


6. Sum the D? column to obtain =D? and substitute in 
the formula. 


7. The denominator is always the quantity N(N? - 1). 
To find this quantity, square N, subtract 1 from N?, 
and multiply by N. 


8. Perform the division indicated by 6x: D? and 
N(N - 1) 
Subtract the quotient from 1. This figure is the 
correlation coefficient. 


It will be observed from the formula p = 1 - 6xD? 

N(N? - 1) 
that the size of the rank differences directly affects the size 
of the coefficient. If the correlation is perfect, that is, if 
the ranks are the same for both tests, each rank difference 
in the D column will equal zero, and the corresponding D? 
column will also be zero. This makes the numerator of the 
Íraction zero and thus the quotient is zero. As a result the 
correlation is 41.0. As the relationship between the Scores 
becomes poorer, the rank differences increase and the frac- 
tion to be subtracted from 1 becomes larger. In the case 
where there is a negative correlation, rank differences are 
very large and the fraction is greater than one. Obviously 
the resulting subtraction results in a negative coefficient. 


| 


| 


EN E 


IU E DUM EM UM M 


o 


VW O AI RS Ve 


H 


E Lee A Ee 


PSS NA CONS MS ONA 


E 


v 


CORRELATION 43 


In Test I in Table 19 Individuals D and E both made the 
same score. When ranks are tied, it would be incorrect to 
give the two scores different ranks. We must give the two 
tied scores the same ranks that fall directly between the 
preceding score and the following score. For example, if 
in t» rank series 7, 8, 9, and 10, there is a tie for the 8th 
and 9th ranks, we first determine the mid-point between 7 
and 10, which is 8.5. Thus these two tied ranks are given 
the rank of 8.5 in the series. If three ranks are tied we use 
the same method. For example, if in the series 14, 15, 16, 
17, and 18, there is a tie for the 15th, 16th and 17th ranks, 
we again determine the mid-point between the preceding 
and following ranks, 14 and 18 which is 16, thus each of the 
three tied ranks is given the rank of 16 in the series. 


The coefficient that we obtained in Table 18 is +. 88. 
This is indicative of a high positive correlation in the two 
sets of scores. This correlation is based on the rank dif- 
ferences in these two sets of scores. 


In the next section we will consider a method of obtaining 
a correlation coefficient which does not use the ranks of the 
Scores but utilizes the actual sizes of the scores in the 
computation. This is called the Pearson product-moment 
Method, and the coefficient is called the Pearson r. 


CALCULATION OF PEARSON 
PRODUCT-MOMENT r 


The formula for the correlation coefficient for this 
method is 
Zxy 


iex» (zy?) 


where r is the correlation coefficient 
Zxy is the product of each x and each y for every 
individual 
zx?is the sum of the squared deviations from the 
mean in Test X 
Zei is the sum of the squared deviations from the 
mean in Test Y. 


44 CORRELATION 


Let us look at this formula. It is necessary to find the 
deviations of each score in Test X from the Mean of Test X, 
and the deviations of each score in Test Y from the Mean of 
Test Y. These deviation scores are denoted in the usual 
fashion as x and y. It is also necessary to find the x? and 
y? values for each deviation score. You will remember 
that we did this in the computation of the SD when we found 
x? for only one set of scores. Since we have two tests we 
also find the y? values for each deviation in Test Y. These 
x? and y? values are summed to obtain =x? and ZI At 
this point a new term is introduced, =xy. To obtain this, 
we multiply each x by each y for every individual taking the 
twotests. For example, Individual B's x-score of 4 would 
be multiplied by his y-score of 2 which gives his xy value 
of 8 in the xy column. This method is shown in Table 20. 


Table 20 
ILLUSTRATION OF PEARSON PRODUCT MOMENT r 


- 


Individual X e wy e xU eg 
A Xe ipn i 3 dik eg ANS 
B iy Gil AED aa wr g 
C 115 6) 2 @ A Wd mg 
D i 9 OF 0 OO - 
E 1205089 0571 0 lm RO, 
F jig 6 sb iL il 1 
G KN ESI TNNT 2 À re T2 
H WP Ge cil 8} M ER Lag 
I 11 B. eg ol ANE OMEN 
J do 6 33 -3 SO E 

130 90 0 0 38 36 26 | 


RP Pah PA FS TR Doe 


A Ca Ch fã 


ÉS AE c I e AA SB A NON A 


NT AN, 
el, RN E 


r 


” 


CORRELATION 45 


SUMMARY 


Substitution: 
Me = dé p, SRE 
ery, JE) cy? 
zx m 
26 
zy! - 36 - E 
zxy = 26 C ES 
. 26 
41368 
0026! 
37 
r= «0 (app.) 


The steps in the calculation of the Pearson r are: 


1. Divide the sheet into eight columns labeled: 
Individual; X (score on Test X); Y (score on 
Test Y); x (deviation of mean of Test X from each 
X); y (deviation of mean of Test Y from each Y); 
x? (square of each deviation in Test X); y? (square 
of each deviation in Test Y); and xy (product of each 
deviation score of Test X and Test Y). 


2. Enter each individual's score for both tests in the 
appropriate columns. Sum these columns and find 
My (mean of Test X) and My (mean of Test Y). 


3. Subtract My from each score in Test X and enter in 
column x. Subtract My from each score in Test Y 
and enter in column y. 


46 CORRELATION 


4. Square each value of x and enter in the po column. 
Similarly, square each value of y and enter in the 
y? column. Sum these two columns to obtain =x? 
and zy?. 


5. Multiply each individual's x value by his y value and 
enter in the xy column. Sum this column to obtain 
Zxy. (NOTE: When multiplying x by y the algebraic 
signs must be taken into account. In Table 20, all 
xy values are positive since a negative value in x 
is multiplied by a negative value of y. In many 
cases, the signs are different for corresponding 
values of x and y, and this negative product is 
entered in the column xy. The total of the xy 
column (Zxy) is the algebraic sum, so the sum of 
the negative numbers in this column must be sub- 
tracted from the sum of the positive numbers to 
obtain =xy). 


6. Prepare a summary table listing the values Mx, 
My, 2x’, zy?, and Zxy. 


7. Substitute in the formular = 2H —— 
Jex) Gy? 


To obtain the numerator of the above formula it is only 
necessary to substitute the value of Zxy from the summary 
table. However, for the denominator, one must first sub- 
stitute the values for zx? and zy? and multiply. Then the 
square root of this product is extracted and the necessary 
division is performed to arrive at the correlation coefficient. 


This correlation coefficient is interpreted in the same 
way as for the Rank-difference Method. The coefficient 
varies from «41.0 to -1.0, and high values indicate close 
relationship. 


COMPARISON OF THE TWO METHODS 


We have discussed two types of correlation that yield 
measures of relationship. Which one should be used? 


e^ 


COD S E E e Ue 


CORRELATION En 
ee o SAE 
As you have noted, the Pearson product-moment 
Method utilizes the actual size of the scores, while the 
Rank-difference Method deals only with the location of the 
Scores in a series, and makes no allowances for the size 
of the gaps between scores. Individuals who score 75, 74, 
and 50 on a test might receive ranks of 1, 2, and 3. Notice 
that there is only one score interval between 75 and 74, and 
the ranks would be 1 and 2. However, there is a wide gap 
between 74 and 50, yet the score 50 would receive a rank 
of 3. Much accuracy may be lost in converting scores to 
ranks, especially when the scores are tied. The Pearson 
product-moment r is to be preferred for greater accuracy. 
However, the Rank-difference Method has its uses. Since 
itis easily computed, it is a handy preliminary device to 
check for the presence of a relationship. It is also useful 
in discovering relationships in criteria that can not be di- 
mensionalized. For example, we may want to find out if 
any relationship exists between scores on an achievement 
test and their excellence in extra-curricular activities. We 
can hardly assign scores to extra-curricular activities, but 
we could have several instructors judge them on their per- 
formances on extra-curricular activities and rank each 
individual. We could then obtain a correlation between 


these ranks and the ranks of their test scores on the achieve- 
ment test. In situations like this, the Rank-difference, 


Method has its greatest value. 


SUMMARY 


The coefficient of correlation is the measure of rela- 
tionship of two sets of data. A coefficient of +1.0 denotes 
a perfect relationship in a positive direction, while a co- 
efficient of -1.0 denotes perfect relationship in a negative 
direction. The greater the coefficient the greater the re- 


lationship that exists. 


The measure of relationship can be calculated by two 
different methods: the coefficient r for the Pearson product- 
moment Method, and the coefficient P for the Rank-difference 
Method. The Pearson method is superior because it is based 


DEE 


48 CORRELATION 

| 
on the size of the scores while the Rank-difference Method s 
is based only on the ranks of the scores. However, the a 
Rank-difference Method is easily calculated and useful as q 
a preliminary device. ó 


SEN Ehe er ec? ed ` LÁ DW Del Del Del Del Del FE el Wed Det STU 


r= 


A 


ug E, 


NE a M Dm 


PLD lo Au pi 


Chapter VI 
EVALUATION AND 
INTERPRETATION OF TESTS 


Finally we have come to the point where we can devote 
a full discussion to the construction and interpretation of 
tests. The previous chapters have given us the necessary 
information and tools to discuss tests intelligently. In this 
chapter we will consider two important characteristics of 
an adequate test: reliability and validity. 


RELIABILITY 


liable?" By reliability, we mean that a test tests consist- 
ently and accurately. If we give a test one time and it gives 
us certain results, and at a second administration gives us 

a totally different result, which test are we to believe? This 
test is not reliable. That is, it is not testing consistently 
and accurately. If a test is not reliable, we do not know 
whether Individual A who scored in the middle of the distribu- 
tion actually belongs at the top or the bottom. In other words, 
there would have not been much point in giving the test in 

the first place. If the object of a test is to separate individ- 
uals on the basis of some certain trait, we do not know if 

we have accomplished any separation if our test is not re- 
liable. 


An important consideration, then, is determining whether 
or not atest is reliable. How might we do this? The most 
straight forward way would be to use a method of correlation. 
If we obtain a high positive correlation between two admin- 
istrations of the same test to the same people, our test must 
be testing consistently, since the high coefficient means a 
high degree of relationship. If several individuals scored 
high on one administration and low on the other, the rela- 
tionship would be lower, and the reliability less. When a 


= 49 = 


50 EVALUATION AND INTERPRETATION 


test is reliable, the scores made by the members of the 
group will be consistent from one administration to the 
next. A reliable test, therefore, is relatively free of 
chance errors of measurement and scores earned on it are 
stable and trustworthy. 


There are three ways in which we can determine the 
reliability of a test. They are the Test-Retest Method, 
Alternate Forms Method and Split-Half Method. 


TEST-RETEST METHOD 
The simplest and most straight forward method for 


determining reliability would be to give the test twice to the 
same individuals. We would then use one of the methods of 


correlation described in the preceding chapter to determine . 


the relationship of the two administrations. As was men- 
tioned before, the coefficient of correlation yielded by this ` 
method would be the relationship on how the individuals per- 
formed on the two tests. If the relationship is high, the 
test is testing consistently and is reliable. 


However, there are various objections to this method. 
If a test is repeated within a short time interval, many 
individuals would be certain to recall answers that they had 
given previously, and thus spend their time on the difficult 
material. This would increase some scores and the cor- 
relation coefficient would not be an accurate estimate of the 
relationship. The type of test would, of course, affect the 
amount of transfer from one administration to the next. 


If the test is repeated after a long time interval, growth 
and maturity (especially if the subjects are children) would 
affect the performance on the second administration. Cer- 
tain experiences by different individuals during this interval 
might influence their performance also. Because of the dif- 
ficulty of controlling these varying conditions, the Test- 
Retest Method is used less frequently than the other two. 


ADA NA 


[$ 


3 


LL mu ET RE E EH 


. EVALUATION AND INTERPRETATION 51 


ALTERNATE FORMS METHOD 


An obvious way to eliminate the objections to the Test- 
Retest Method would be to give a different test at the second 
administration. Then there would be no memory factor to 
increase the scores of some individuals. This different 
test for the second administration must be very similar to 
the first test if our reliability coefficient is to be meaning- 
ful. Let us denote these alternate forms as Form A and 

. Form B. 


However, it would be necessary to construct an Alter- 
nate Form B for every test for which we wanted to deter- 
mine the reliability coefficient. Many times this is not 
feasible because of the amount of time and work involved. 
When accurate alternate forms. are constructed, the reli- 
ability coefficient yielded is relatively accurate. 


SPLIT-HALF METHOD 


Perhaps the easiest method for determining the relia- 
bility of a test is by the Split-Half Method. In this method 
the test is broken down into two parts, and a correlation 
coefficient is obtained between the two parts of the test. 
The most often used method for dividing the test into two 
parts is the odd and even method. In this way, each indi- 
vidual has two scores, a score on the odd numbered items 
inatest, and a score on the even-items: It is necessary 
to go through the answer sheet for each individual and tabu- 
late the number of right on the odd numbered items and the 
number right on the even numbered items. These scores 
are tabulated in the X and Y column of the Pearson product- 
moment table such as in Table 21. In essence, we are 
treating the two halves as separate tests. The tabulation 
for one individual might be: 


Individual Total Odd numbered Even numbered X Y 
Score items correct items correct (odd) (even) 


A 32 17 15 Berl 


52 EVALUATION AND INTERPRETATION 


We would do this for each individual in the test until 
we had a table similar to the one of Table 20. The coef- 
ficient yielded by this method would be the relationship of 
the two halves of the test. However, we want a coefficient 
that gives us the reliability of the entire test. It is neces- 
sary to substitute in the following formula (Spearman - 
Brown Prophecy Formula). 

nc 2roe 
1 + roe 


rt = reliability coefficient of entire test 
Yoe = reliability coefficient of 1/2 of the test 
The r; is the reliability coefficient of the entire test. 


Let us consider an example using this method for finding 
the reliability of a test. 


Table 21 


ILLUSTRATION OF SPLIT HALF METHOD 
OF CORRELATION 


Individual X(odd) Y(even) x y x? y? xy 
A 12 10 1 0 ir, Ope O: 
Kb 10 8 -1 -2 i 4 2 
D 9 11 -2 1 4 1 -2 
D 14 11 3 1 9 1 3 
E 13 10 2 0 4 0 0 
F 8 8 -3 -2 9 4 6 
G 12 11 1 1 1 1 1 
H 11 10 0 0 0 0 0 
I 11 11 0 i 0 1 0 
J .10 eU eie cal o dO AO) 
110 100 0 07 880 ipi LO) 


JULII A fS. LDL 9 e. 626 Dm) DAL 


| EVALUATION AND INTERPRETATION 53 


T SUMMARY 
d y 
Ca Mx = 11 Toe = xy — 
> (zx? (zy? 
) My = 10 
Ce eo 
d B= (99 Joo a2 
) 
zy? = 
| XA I UE 
À zxy = 10 360 


roe = - 53 (app.) 


The steps in determining the reliability of the split-half 
test are the same as in the general procedure for determin- 
ing the Pearson product-moment r. However, the coeffi- 
cient yielded is based on one-half of the test. To find the 
1 reliability of the entire test, it is necessary to substitute 
X in the Spearman-Brown prophecy formula. 


The reliability of the entire test is given by: 


) rt = .63 (app.) 
* 


` This gives the reliability coefficient for the entire test. 


54 EVALUATION AND INTERPRETATION 


We have seen in the last few paragraphs how we can 3» 
estimate the reliability using the Split-Half Method of cor- 
relation. What is the rationale for this method? The Test- 
Retest and the Alternate Form methods were straight for- 
ward in that if reliability is present an individual will make 
comparable scores in regard to the rest of the group on 


both administrations. Is there similar reasoning behind 
the Split-Half method? 


When we separate an individual's total score into odd 
and even correct, we would expect him to do equally well 
on the odd and even items (if odd and even items are matched 
for difficulty). If this is the case, the test is reliable, i.e., 
itis measuring consistently. If he does not do equally well, 
the test is not measuring consistently and therefore is not 
reliable. However, this Split-half coefficient is not the 
same thing as the coefficients obtained by the Alternate 
Form or Test-Retest methods. It measures internal con- 
Sistency and not consistency from administration to admin- 
istration. Whenever the reliability coefficient is cited in 
reports and articles, it is usually identified by the method 
used to obtain it. 


THE STANDARD ERROR OF 
MEASUREMENT 


When we review the results of a test, how much con- 
fidence can we place in our scores? In other words, how 
accurate are our scores? Suppose that Individual A made 
a score of 79. How much confidence do we have that our 
test is measuring accurately and that 79 is the individual's 
true score? By the true score we refer to the score that 
the test would give if there were no errors present in de- 
termining his score. We may represent this by the formula: 


XT = X - Xp P 


Ex 2 


where XT - "true" score 
X  - obtained score 
Ze = "error" score 


aX 7^ 


a NE E E ac aa PÇ nel a A ig 


Pa. "Ef. CN ` Ste" dm "` 


Kl gn ao” 1 a o RAE A et VA e e D En uno» uM cau 


EVALUATION AND INTERPRETATION 55 


It is obvious that the confidence we can place in a score 
depends on the gap between the obtained score and the true 
Score. If the difference between the true and obtained score 
is small, we can be quite confident that the obtained score 
is a good measure of the individual's performance. However, 
if the difference is large, i.e., there is a great discrepancy 
between the obtained and true scores, our test has given 
us a faulty measurement. 


Unfortunately, our test scores are obtained scores 
and we have no idea of just exactly what the true score for 
any individual might be. There is a way in which we can 
determine to a certain extent how much a score might devi- 
ate from a true value. This is commonly called the Standard 
Error of Measurement. The formula is 

SE = SD (1 - rt 
where 
SD - standard deviation of the distribution 


"i 


H 


reliability coefficient of the test. 


We use the Standard Error of Measurement to deter- 
mine the range in which the true score of an individual 
probably lies. 1f the obtained score of an individual is 
75 and the Standard Error of Measurement is 5, we can 
say that two out of three times his obtained score does not 
differ from his true score more than +5. In two out of three 
times his true score would actually fall between 70 and 80. 


Let us consider an example. Suppose that the SD of a 
test is 10 and the reliability coefficient is .84. By the 
above formula the Standard Error of Measurement is 


SE = SDJi - r 


= 101 - .84 
= 10 J.16 
= 10 (.4) 
SE = 4 


56 EVALUATION AND INTERPRETATION 
o 


Thus the odds are two to one that the obtained score of 
any individual does not differ from his true score by +4. 
If Individual A had a score of 79, we may feel confident 
that his true score actually lies in the range from 75 to 83. 


The main purpose in testing is to separate individuals 
in respect to the trait that we are testing. You have prob- 
ably inferred by now that two different scores made by two 
individuals does not necessarily mean they are different in 
respect to the trait being measured. Suppose that one 
makes a score of 75 and another a score of 78, and the 
Standard Error of Measurement is 4. The range for the 
first individual is 75 +4, or 71 to 79, and for the second 
individual 78 t4, or 74 to 82. Has our test done any sep- 
arating? We cannot be sure on scores that are close to- 
gether within the limits of the Standard Error of Measure- 
ment. If there is an overlap in the range, such as in the 
above example, we cannot be sure that the true scores of 
the individuals are actually different. It is obvious that the 
smaller the Standard Error of Measurement the more ac- 
curate our obtained scores. 


We should notice from the formula 


SE = SD A1 - x 


that the reliability of the test is important in determing the 
size of the Standard Error of Measurement. If rt is perfect, 
or 1.0, the term under the radical reduces to zero, and the 
Standard Error of Measurement is now zero. If rt is zero 
the Standard Error becomes the same as the SD. We can 

“ see from this that the higher the reliability of the test, the 
smaller the Standard Error of Measurement. 

e 


VALIDITY 


As was mentioned earlier in this chapter, one of the 
necessary requirements for a good test is validity. By 
validity we mean that a test is testing what it is supposed 
totest. If we construct a test to measure mathematical 


EE Lo um AA NR Um UR Am AA Gm CARS AVAL AA mmc a Ml. Umm m EN E EE e i yn pr 


pm 


EVALUATION AND INTERPRETATION 57 


achievement and it turns out as a better measure of ability in 
cake-baking it has little validity. To be valid a test must 
serve the purpose for which it is intended. 


A test can be highly reliable, but not valid. In the 
somewhat exaggerated example above, our test may be 
highly reliable (i.e., give consistent results) but it certainly 
is not valid. It is not testing what it is supposed to test. 


We may determine the validity of a test by calculating 
the validity coefficient. To do this, we use our well known 
method of correlation between two tests and the correlation 
coefficient is our validity coefficient. 


What other tests do we use to determine our coefficient? 
The usual procedure is to choose some other test that is 
well known as a good test for the purpose for which our test 
is intended. The test is known as the criterion. If we con- 
Struct a test to determine intelligence, we want it to cor- 
relate highly with a well-known test in this area. Our cri- 


. terion might be the Wechsler-Bellevue or the Stanford- 


Binet. If the relationship is good, i.e., the validity coef- 
ficient is high, our test must be testing the same thing as 
the criterion. 


It should be obvious by now that a test is worthless if 
itis not valid. If thevalidity of a test we have constructed 
is low, we must consider improving it or discarding it al- 
together. 


Once the criterion is selected, it is a straight forward 
procedure to determine a validity coefficient. We need only 
to administer the two different test a group and find the 
correlation coefficient betae. If the rela- 
tionship is good, the coefficient is high, and our test is 
highly valid. If the relationship is poor, the coefficient is 
low, and our test is not valid. An illustration of the compu- 
tation of the validity coefficient is shown in Table 22. Let 
us say that Text X is a test that we constructed for deter- 
mining ability in arithmetic. The criterion is Test Y, a 


58 EVALUATION AND INTERPRETATION 


standardized test for arithmetic ability. To find the corre- 
lation we use the Pearson product-moment method. 


Table 22 
CALCULATION OF A VALIDITY COEFFICIENT 


Individual EE Wen 
A de E fas | ek Ne 
B EL 23: 302^ wg Las 
c E!" 25 4 4 16 16 16 
D HN. CM O 9m 9- S 
E A E DUET EE 
F IAE ASTE. * 2f CEU, PIN E 
G V A TD E NET uo c 
H o O E LE ILI E 
I 9 17 -4 -4 16 16 16 
J en ES da «Dolo 19. a5 e 

130 210 O TO $2 88 69 
SUMMARY 
My - 13 Ms Aya e 
My = 21 (Ex) (zy?) 
zx* = 62 = 88 
(62) (88) 
zy? = 88 
e 69 
a 68 5456 
ke 
74 
r =» 493:(app;) 


Since the correlation is high, we will assume that our 
test is measuring the same thing as the criterion. 


ru 


«RN €x ox 


e 


z 
d 


SÉ zs SR SS "Re o d 25 2 0 8L db Addis cri Set ent Bien ^ O a 


a ow ké 


p Suy Wy ou ow 


EVALUATION AND INTERPRETATION 59 


In determining the validity, it is necessary that the 
reliability coefficient of both tests be high. This is a matter 
of common sense, since our validity coefficient is meaning- 
less if either of the tests is not reliable. 4A validity coef- 
ficient of .93 is very high. Usually validity coefficients 
are of the order of . 60 to . 70. 


How large must the validity coefficient be before we 
can infer that our test is valid? This will depend on the 
type of test. For a discussion of the different size validity 
coefficients for various tests the reader is referred to any 
standard text in statistics. 


THE USE OF STANDARDIZED TESTS 


The classroom teacher often finds himself faced with 
the administration of one or more standardized tests to his 
pupils at least once during the school year. It is usually 
difficult if not impossible for many schools to have the serv- 
ice of a trained test technician to handle the testing program. 
As a result, the administration and interpretation of the 
tests is left to the teacher. Some teachers begrudgingly 
administer the test, and, because of their feelings of inad- 
equacy and lack of confidence in test practices, fail to make 
full use of the information given by the scores. On the other 
hand, some with a flair for testing will plunge into an inten- 
sive testing program and make all sorts of unqualified con- 
clusions and assumptions based on the test scores. 


With a growing emphasis on the use of standardized 
tests, the future teacher needs to have a middle of the road 
approach. He has to be confident in his testing and evalua- 
tion, but, at the same time, cautious and conservative in 
his use and interpretations. This, as with anything else, is 
established through practice and constant use. 


As was mentioned previously, a standardized test is one 
that has been administered to a selected sample, and norms 
have been constructed on the basis of these scores (review 
Chapter III). A teacher-made test is usually considered 


60 EVALUATION AND INTERPRETATION 


non-standardized because it is written for the sole purpose 
of discriminati on of achievement for one class. However, 
many standardized tests were at one time teacher-made 
tests, but with constant refinement were finally administered 
to a sample of students, norms were constructed, and the 
test was printed and put on the market. 


The reader will note in the preceding paragraph the 
repetition of the term sampling. This forms the important 
first rule in the use of standardized tests: the test that is 


being used should have norms based on a sample that is 


very much like the test group. Otherwise, the scores 
made by the group will not be comparable when placed on 
the norms for the standardized test. The manual accompa- 
nying the test usually gives the information as to the nature 
of the sample. 


Of course, many tests will not be based on a sample 
that is exactly like the group to be tested. In this case the 
teacher must use a certain amount of discretion in inter- 
preting the scores. If the group to be tested is made up of 
pupils in a small high school in Minnesota and the test norms 
are based on a sample from large high schools in another 
State, certain allowances will have to be made. 


The second rule concerns the purpose of the test. In 
most situations the school administration selects the test 
and the teacher has only the responsibility of giving the 
test. However, if the teacher has the freedom to select 
the test, the test must be one that is going to measure what 


The third rule incorporates many smaller ones. Fol- 
low the instruction manual carefully. The purpose of the 
manual is to make the test situation as similar as possible 


cm a E c m E TE CT ECH ET A 


QUU SS 2S E ` zb Sl E wo cm 


SS a 


NAE am 


P e A 


eal 


EVALUATION AND INTERPRETATION 61 


to that of the original sample upon which the norms were 
based. If the norms are to be accurate, the test situations 
must be similar. The test manual includes instructions on 
how to administer the test, instructions to the examinees, 
time limits for the various sections of the test, and direc- 
tions for plotting a profile chart of each student and the 
interpretation of these charts. Since each test is different 
these points will not be taken up individually. However, it 
cannot be overemphasized that the manual must be followed 
carefully. 


SUMMARY 


It must be remembered that standardized tests, like 
allothers, are subject to certain limitations. Generally 
speaking, the Standard Error of Measurement is usually 
less for standardized tests than for teacher-made tests. 
However, the interpretation of test scores must take into 
account motivation, emotional level, and other factors that 
influence test scores. It can be seen that a test score is 
just a sample of an individual's performance. With repeated 
testing, the average score on these repetitions would be A 
indicative of his performance in the long run, but a test is 
administered only once. His score may be too high, too 
low, or just right. As a result, a test score should not 
be considered absolute, but should be interpreted with ref- 
erence to other factors. For an excellent discussion refer 
to Lindquist in the appendix. 


p= \ 


y Cra 


= 


a 


> 


C Vë DSO uy 


Qu S 


Wi S Wi wi wi wy 


- ` wi Wi 


SZ SS Si 


Chapter VII 
SUMMARY AND CONCLUSIONS 


In the previous chapters, the reader has been introduced 
.to a few of the basic statistical techniques needed to construct 
and interpret tests. It would be a near-impossible task in 
a short manual to list all the factors that are necessary for 
the interpretation of test scores. However, if the reader 
has mastered the essentials set forth in the previous chap- 
ters, he is ready to begin the long and tedious process by 
which he becomes proficient in testing. Many of the tech- 
niques have been omitted in this manual because of its intro- 
ductory nature. These will be found in the suggested read- 
ings in the appendix. 


The word that stands out among all the rest in test inter- 
pretation is "practice". Constructing tests, analyzing items, 
readministering the same test, etc.; all are processes 
that become easier and more meaningful as time goes on. 
The tester will find that many of these processes will become 
Second nature after a time and soon a certain "test sensitiv- 
ity" will set in. This is the goal of tests and measurements 
in general. 


One of the most common mistakes for the novice is to 
place too much confidence in a test score. There are many 
variables that affect a person's performance on a test. Fa- 
tigue, attitude, and others may interfere with the perform- 
ance to a greater or lesser degree. For a complete discus- 
Sion of these factors that widen the gap between the obtained 
Score and the true score, the reader is referred to the book 
by Lindquist in the appendix. 


Every statistical concept that has been discussed is in- 
fluenced more or less by certain factors that are operating 
with or without the examiner's knowledge. Part of the in- 
terpretation of the standard deviation depends on the shape 


22693 = 


64 SUMMARY AND CONCLUSIONS 


of the distribution curve; the Pearson product-moment r 
depends on the linearity of the curve; etc. Every statistical 
concept has some limitations. A few of these have been 
mentioned in the course of the discussion, but a full treat- 
ment of these limitations is beyond the scope of this manual. 
Any of the references cited in the appendix will serve to 
give the reader the nature and importance of these limita- 
tions. 


If the reader has acquired the necessary statistical 
background to become interested in tests and test practices, 
this manual has served its purpose. This manual also 
might be handy in the future as a quick reference to some 
statistical technique or interpretation. With continued use, 
the reader can develop the sensitivity to testing that will be 
necessary in the present educational testing movement. r 


APPENDIX 


PROBLEMS AND ANSWERS 


SAMPLING PROCEDURES 


SUGGESTED READINGS AND REFERENCES 


= 


E rf Sé 


I 


PROBLEMS 
AND ANSWERS 


Note: Many of the following exercises include small samples 
to make the computations easier. As a result, the 
conclusions based on these results may not necessar- 
ily hold true. The questions based on the results, 
therefore, should be interpreted as referring to com- 
aye from much larger samples. (See Introduc- 
tion 


PROBLEMS 


~~ 
PF e Kell et Kai IFS 


P 


CHAPTER I 


1. The following are 115 scores on an English examination. 


54 
52 
51 
49 
49 
48 
47 
47 
47 
45 


44 
44 
44 
43 
43 
43 
43 
43 
42 
42 


41 
40 
39 
39 
39 
39 
38 
38 
38 
38 


37 
37 
37 
36 
36 
36 
36 
36 
35 
35 


35 
34 
34 
34 
33 
33 
33 
33 
33 
32 


32 
32 
32 
32 
32 
32 
31 
31 
31 
31 


31 
31 
30 
30 
30 
30 
30 
30 
29 
29 


29 
29 
29 
28 
28 
28 
28 
28 
28 


28 


si 
From these scores, construct three frequency distribu- 
tions. 


a. 


b. 


Use six classes, with the first class 0-9, and the 
last 50-59. 
Use 10 classes, with the first class 5-9, and the 
last 50-54. 
Use 17 classes, with the first class 5-7, and the 
last 53-55. 


Gis 


68 


APPENDIX 


Study the three distributions. Notice the loss of sensi- 
tivity when there are only six classes. Also observe 
the distribution with 17 classes. Notice there is a loss 
of sensitivity when too many classes are employed. A 
rule of thumb which is often helpful is to use from eight 
to 14 classes depending upon the number and the range 
of the scores. 


CHAPTER II 


The following data consists of 20 scores on a spelling 
test. 


X — (9 
15 (a) Calculate the mean. M = 


14 (b) Check your answer by sub- 
12 tracting the mean from each 
12 one of the scores. If your 

12 mean is correct the algebraic 
12 sum of the deviations from 
10 the mean should equal zero. 


=X 


2. 


APPENDIX 69 


The following 20 scores were obtained on a physical 
fitness test. 


AX (a) Calculate: 
21 Mean 
21 Median 
20 Mode ` 
20 

18 (b) Which of the three is the 
18 poorest measure of central 
18 tendency for this data? 


Wow dw 


CHAPTER HI 


Using the data given on the preceding page of the 115 
English scores, calculate the following percentiles: 


Pos = 


Je 
| 


Pg) is also the median. Why is this so? 


70 


APPENDIX 


John Brown is rated at the 10th Percentile on an 
English proficiency test. Just what does this statement 
mean? 


When you see that a student is at the 50th Percentile on 
national norms on the test described above, why do 
you conclude that he is an "average" student? É 


CHAPTER IV 


Calculate the standard deviation for the following set 
of scores by the two different methods described in ` 
Chapter IV. 


Deviation Method Whole Score Method 
x Si Gi X o S 
15 15 
14 14 
13 13 
10 10 
10 10 
9 9: 

8 8 
i'i df 
5 5 
A E KÉ 

SG exe ZX = 95 =X? = 

= 9.5 M = 9.5 

= 10 N = 10 


v 


ES 

=- D alii 
e |ZX MP. 
b. SD = ^X - 


` e NOP ` oi ` ei ee "Së? 


V ON 


3 e VE qa 


Se 


rt 


wwe WR 


LAT Y, WE) 
X V oW 


E 
1 


APPENDIX m 


c. Which method did you find the easier? Which 
method would be easier to use if the scores were 
small and the mean were a whole number ? 


A student writes examinations in a history course and 
an English course. The following information is taken 
from the two tests: d 


History English 
Student's score 62 80 
Mean 54 "l0 
SD 4 10 


a. Interms of standard scores, on which test did 
he do better ? 


he following are scores by 20 students on a short 
Sarrik quiz. 


a a. Calculate the 
10 Mean = 
10 Median = 

9 Mode = 

9 SD = 

9 

8 b. If this small sample is normally 
8 distributed, 68 per cent of the 

8 scores should fall between what 
8 two values? 

8 

yj and 

7 

7 USE THIS SPACE FOR COMPUTATIONS 
5 

4 

3 

3 

2 

—2 


H 
m 
A 
e 


72 APPENDIX 


CHAPTER V 


j 

1. /Calculate the correlation coefficient between an algebra 
and a geometry test administered to 10 students using 
the Rank-difference Method. 


Student Algebra Geometry 
SUURE R D D? 
A 97 84 E Ee F 
B 85 52 
C 74 93 
D 63 100 
E 52 65 
F 51 42 
G 49 51 
H 47 39 
I 31 38 
J 20 36 


=D? = 


2 
2 EL SAD, 


b. Isthere a high degree of relationship between 
these two tests? 


ve ` e" sed "ee a em 


Nef 


Lh S 3 2h 2E 20 e: ww WE 


= 


poe Te ON WWW 


2. 


+ 


APPENDIX w 


Calculate the correlation coefficient for the scores on 
two tests below, using the Pearson product-moment 
Method. 


Individual Test I Test II 


x NC Fs oM x yA xX 
A i9 Dae iv S Eso 
B 18 21 
E 15 25 - 
D 15 20- 
E 12 21 
T 12 18 
G 11 ^ 
H 10 1572 
I 9 14° 
J 9 ies E T 
=X = SE 0 O zx= zy!  zxy- 
Mean = Mean = 
Dxy 


q (2x) (zy?) 


b. Do the two tests show a high degree of relationship? 


CHAPTER VI 


An arithmetic test was administered to ten pupils in 
the seventh grade. Each pupil's score was separated 
to give number of odd and even items correct. Calcu- 
late the (a) reliability of the split-half forms of the test 
using the Pearson product-moment Method of correla- 
tion, and (b) the reliability of the total test using the 
Spearman-Brown Prophecy Formula. 


74 APPENDIX 
a MUT 
Individ- 
ual Odd Even 
x YOU o. x? y? xy 
A Gm E 7 Ivy i y 
B 14 12 
C 14 13 
D 13 15 
E 13 16 
F 12 12 
G 10 9 
H 10 8 
I 9 7 
J 9 E = a = E E 
DX = Divi 0 0 =x E zxy- 
Mean =  Mean'= 
C. ss = 
(zx?) (zy?) 
b Tt zi 2roe 
1 + Toe 


c. Isthis test consistent in its results? 


APPENDIX 75 


2. You have constructed a test to measure fifth grade 
arithmetic achievement, and would like to know how 
your test compares with the Woody Arithmetic Test. 
To determine this, you administer both tests to ten 
fifth graders. Compute the validity coefficient by 
means of the Pearson product-moment Method. 


Individ- Your Woody 


ual ` Test 

^ x Ce qi Mss ME xy 
A A 40 16 

B 39 17 
H Cc 37 15 
) D 36 16 
x E 36 10 
A F 35 12 
Ta G 34 9 
b H 34 9 
2 I 30 8 
S J 29 A e e EE 
is DX = zY- (e mett, cy 
is Mean = Mean = 
5 
5 A, so 29055 
5 Jx?) (Gy? 
K 
k b. Do the results of your arithmetic test agree with 
i H the results of the Woody? 
b. 
| 
b 
| 
| 


ANSWERS 


Chapter II 


2. a. Mean  - 13 
Median - 14.5 
Mode = 18 


b. The mode is the poorest because the score most 
frequently made in this test is at the extreme end 
of the distribution. 

Chapter III 
1. Py = 26 


Pg = 44 


Pg - 31 


2. Because the median is defined as the point that is in 
the exact center of the distribution, 50 per cent of the 
scores lie below that point, which has the same mean- 
ing as Pg. 


eh Only 10 per cent of the students scored lower than 
John Brown. 


4. If we define average as someone scoring in the exact 
center of the distribution (the median is a measure of 
an "average" score) than Pg meets these requirements. 


Chapter IV 


- 76- 


€ 
d 
é 
é 
d 
é 
e 
( 


APPENDIX TT 


b. SD = [1025 - (9.5) = 3.5 
10 
c. The whole score method might possibly have been 
easier because the deviation method required both 
subtraction and squaring while the whole score 
method needed only squaring. The deviation meth- 
od is convenient when scores are small and the 
mean is a whole number due to the small size of 
the numbers after subtraction. 
2. a. Standard Score on History Test: 62 - 54 _ 9 
4 
Standard Score on English Test: 80 - A — 1 
t 10 
In terms of the rest of the class, the individual did 
better on the history test. 
3. a. Mean = 7 
Mode = 8 
Median = 8 
sD = 2,9 
b. 68 per cent of the scores in a normal distribution 
should fall between +1 and -1 standard deviations 
from the mean or between 4.1 and 9.9. 
Chapter V * 
jee d. el El ima ere 
10 (100 - 1) 990 a 
b. There appears to be a high degree of relationship 


because a coefficient of . 84 would indicate that 
those that scored high on one test scored high on the 
other, and those that scored low on one scored low 
on the other. 


78 APPENDIX 
Ee pel eee wo LL me es 
N (116) dn) 13,688 17 . 
5 — T 
SR 
b. Yes, a coefficient of . 91 shows that the relation- 
ship is very good. 
Chapter VI 
T. a, E 67 d Gu x Qt = at 
oe = = E 
J62) (134) Jee 99.5 
Bir 42355526 BOJO» 1 15607 298 
Fuss NBI 1080 *- 
c. With an rg of . 89, this test gives fairly consistent 
results. 
98 98 
2; a mE T E ——— M = 75985 22595) 
(111) (120) — 413,320 115.4 ` 
b. With a validity coefficient this high the results of 


both tests agree very well, and both tests appear 
to be measuring the same thing. 


p gem Ja ie A a QU we A e, 26, ee, "ëm: An 


E? 


ke as 


II 
SAMPLING PROCEDURES 


It is of interest to see just how norms for the various 
standardized tests are obtained. In Chapter III a college 
} entrance examination was administered to a sample of 1000 
students and these scores and accompanying percentiles made 
up the norms. How was this sample of 1000 students chosen? 


It is impossible to include all college freshmen in the 
sample. The size of the sample is arbitrary. However, 
the larger the sample (all other things being equal)the smaller 
is the chance of severe errors in the sampling procedure. 
The choice of the size of sample is usually determined by 
the amount of time and money that can be spent on standard- 
izing these tests. Just how are these samples obtained ? 


RANDOM SAMPLING 


In a random sample, every individual in the population 
has an equal chance of being included in the sample. Popula- 
tion does not necessarily mean every person in the U. 8. 

The population depends on just whom the test is being con- 
structed for. If the test is to be general and might be used 
on any member of the U. S., then the population refers to all 
of the inhabitants of the U. S. However, if the test is to be 
used on bakery employees, then the population would consist 
of all bakery employees, in the U. S., and the sample would 
be picked at random from them. 


Population, then, might be all of the garage attendents 
in the U. S., all of the government employees in a state, or 
all of the freshmen in one college. The population depends 
on the future use of the test. Consider an example which 
might clarify this last point. 


= 19 = 


80 APPENDIX 


Suppose that a test is being constructed that will be 
given to girls employed in state jobs as clerical help. Who 
should comprise the population? Since this is for a particu- 
lar state there would be no reason to standardize the test on 
the basis of all clerical help in the nation, so the population 
at least would be confined to one state. Since the test is of 
a clerical nature intended for state employees, the popula- 
tion would be all clerical help hired by the state. The 
population may be further restricted by intending the test 
for typists only, or for bookkeepers only. 


The sample must come from this population for which 
the test is being constructed. In random sampling, each 
person in the sampling (in this case, every state-employed 
typist or bookkeeper) must have an equal chance to be in- 
cluded in the sample. This could be done in a number of 
ways. })One procedure involves a list of these individuals in 
alphabi tical order and every sixth or tenth one in the list is 


chosen to make up the sample. »Or a number may be assigned 
to every individual, and with the help of a table of random - 
nümbers the girls will be chosen that have the number as- 
Signed to them. This table of random numbers, which is 
given in most statistics texts, is a set of numbers chosen 
mechanically and each number has an equal chance of ap- 
pearing. A table of random numbers and directions on how 
to use them is given in the book by Edwards cited in the 
references. 


But why use a table of random numbers? Couldn't the 
test constructors simply pick these numbers from the list? 
The answer is an unequivocal no! A surprising amount of 
experimenter bias exists in a situation such as this. One 
experiment performed to show that experimenter bias exists 
gave evidence that many people have favorite numbers. If 
an individual's favorite numbers should happen to be 5, 7, 
and 9, and he attempted to pick a "random" sample from a 
list, a majority of 5, 7, and 9's would show up. The best 
way to avoid this is to use a table of random numbers. 


Why is it necessary to have a random sample? For 
one thing, the sample must be a representative sample of 


Ge a PA PA PASSA PA 


AAA a 


^" m Dp mA A! 


Wai 


Le, E PARAR o em 


APPENDIX 81, 


the population. Certain characteristics of some people may 
tend to give too high a score than they should actually have. 
Other characteristics may tend to give a lower score. If it 
is assumed that these characteristics are distributed in the 
population according to chance factors, choosing the samples 
by a table of random numbers should include an equal rep- 
resentation of people with these different characteristics. 
Sampling theory makes extremely interesting reading, and 
the reader is referred to any statistic book for a fuller dis- 
cussion of it. 


STRATIFIED OR QUOTA SAMPLING 


Random sampling sometimes does not give a repre- 
sentative sample. Why not? When there are different sub- 
groups or strata (thus the term stratified), the sample must 
contain individuals drawn from each stratum in accordance 
with the sizes of the sub-group. This type of sampling de- 
pends on the population for which the test is intended. In 
the preceding section random selection sufficed, because 
the test was for state employed clerical help, per se, and 
supposedly had no factors in it that were influenced by reli- 
gion, income, and so forth. But suppose that a test is being 
constructed which will be given eventually to any child from 
the ages of six to twelve. Then norms must be based on 
all children from the ages of six to twelve. Also suppose 
that the variables in the test are affected in part by the eco- 
nomic level of the home. This was the case of the Stanford- 
Binet intelligence test for children. This test was standard- 
ized on the basis of approximately 3, 000 children. 


To make sure of an adequate selection of children that 
was representative of the population, the occupational level 
of the children's fathers in the standard group were checked 
against six occupational classifications of males based on the 
U. S. Census of 1930. The six occupational groups listed 
were professionals, semi-professionals, business men, 
farmers, skilled laborers, and unskilled or slightly skilled 
laborers. Since there were differing proportions of men 
in these groups, the per cent of children with fathers of a 


82 APPENDIX 
SE 
certain occupation in the standard group had to match that 
of the general population. For example, if five per cent 

of the employed males were in the professional group, then 
five per cent of the children in the standard group must have 
fathers in the professional group. 


So it can be seen that sometimes stratified sampling 
must be used if the sample is to be a totally representative 
sample of the population for which the test is intended. As 
was mentioned in the previous section on random sampling, 
the reader is referred to any statistics text for a more 
thorough discussion of stratified sampling. 


é 
í 
[ 
1 
( 
( 
: 
q 
é 
k 
( 
( 
f 
4 
g 
4 
b 


4 Y 
II 


SUGGESTED READINGS AND 
REFERENCES 


Edwards, A. L. Statistical Methods for the Behavioral 


Sciences. New York: Rinehart and Company, Inc., 
1955. 


. Garrett, H. E. Statistics in Psychology and Education 
(4th ed). New York: Longmans, Green, and Co., 


1953. 


Lindquist, E. F. (Editor) Educational Measurement. 
Washington, D. C.: American Council on Education, 
1951. 


Ross, C. C. and Stanley, J. C. Measurement in Today's 
Schools (3rd ed). New York: Prentice-Hall, Inc., 
1954. 

Tinker, M. A. Introduction to Methods in Experimental 4 
Psychology (2nd ed). New York: Appleton-Century- 
Crofts, Inc., 1947. 


Form No. 3. 
PSY, RES.L-1 
Bureau of Educational & Psychological 
Research Library. 


M — 
The book is to be returned within 
the date stamped last. 


——— 
WBGP-59/00-5119C-5M. 


